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ABSTRACT 


This thesis investigates the application of AutoRegressive (AR) modeling techniques 
on single syllable words to detect foreign accents in spoken American English. The study 
involves thirty-one native American English speakers, and six native Brazilian speakers. 
Five different distance measures are used for classification. Results show that correct 
classification is obtained for 88 % of the native English speakers and 80.5 % of the 


non-native (foreign) English speakers. 
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I. INTRODUCTION 


The goal of accent recognition investigated in this thesis is to automatically detect 
non-native (foreign) English speakers as foreign, and native American English speakers as 
native. Automatic recognition refers to the ability to detect foreign accents using 
computers or machines. The detection of foreign accents by ear is common practice. 
However, the automatic detection of foreign accents is difficult due to the time varying 
frequencies in normal speech and the potential bias of loudness, and how fast individuals 
speak. 

This thesis considers the use of a few single syllable words common in daily speech. A 
normalization technique limits the effects of loudness, and how fast individuals speak. 
This study focuses on one group of non-native English speakers with the notion that the 
techniques used for accent detection may be extended to recognize non-native English 
speakers from many languages. The group selected for this study consists of Brazilian 
students attending the Naval Postgraduate School. The word list used is made up of 
words that are difficult for native Brazilians to pronounce. This word list selection 
process is based on the idea that "You Can't Teach Old Dogs New Tricks" [1, 2] and that 
the sounds used in native American English that are different from those sounds used by 
native Brazilians will be more often mispronounced. The native English speakers used in 


this study are originally from various regions of the United States and are all military 





servicemen which limits regional accent due to the many areas of their travels and 
residences. The techniques described in this study may enhance the ability to recognize 
foreign accents and enable language schools to test student accents automatically. 
Additionally, the ability to recognize foreign accents has broad military use. 

The remainder of this thesis is organized as follows: Chapter II introduces speech 
analysis and presents a brief introduction to phonetic concepts. Chapter III introduces 
AutoRegressive (AR) modeling. Chapter IV presents the method of data collection, 
preparation, and normalization used for this study. Chapter V presents the premise of 
foreign accents and word list selection. Chapter VI introduces the performance measures 
used to test the various speakers. Results are presented in Chapter VII. Finally, Chapter 


VIII presents conclusions and recommendations for future research. 


II]. SPEECH ANALYSIS 


This chapter first explains how speech may be divided into individual sounds and 
combinations of sounds. Next, speech terminology is introduced with brief explanations 
and definitions [3]. Finally, speech analysis techniques used to obtain information from 
speech signals are presented. 

Speech signals have a special quality that most other signals do not have; their 
contents are usually recognizable to the listener even if the listener does not know what to 
expect. In addition, the quality or noisiness of the signal is usually immediately apparent, 
while the quality of other signals, such as tones or groups of tones, would not be as 
apparent to the lay listener. Speech is made up of many sounds created by many different 
mechanisms of articulation. This means that although every person sounds a little 
different, and even though there are many accents in normal speech, the various speech 
signals are still understandable among speakers speaking the same language. Linguistics is 
the scientific study of language and the manner in which these rules are used in human 
communication. The study of the abstract units and their relationships in a language is 
called phonemics, while the study of the actual sounds of the language is called phonetics. 
Phonemes are the basic theoretical unit for describing how speech conveys linguistic 
meaning (for example: the word "man" is constituted of three phonemes /m/, /z/, /n/). 
The English language has forty-two phonemes which are listed in Table 1 [4]. English, in 


this study, refers to American English. Phonemes are defined as theoretical or ideal 


TABLE 1! AMERICAN ENGLISH PHONEMES [Carrell, J., and Tiffany W., Phonetics: 
Theory and Application to Speech Improvement, McGraw-Hill 1960, reproduced with 
permission from the Publisher] 


Front vowels 


SYMBOL 


KEY 


heed [hid] 
hid lud] {u) 
hayed [hed] {o} 


Back vowels 


REY 


who’d [hud] 


hood 
hoed 


thud] 
hod] 


head = [hed {a} 
faach = flnaared] ja} 


mem = a  - =: 


hawedl (had) 
hod = thud] 


Central vowels Diphthongst 
hurt [hot] 
hut = [hat] 
under [anda] 
about fubaut] 


{ual file 
{uu] fowl 
fai] fuil 
tiul fuel 


{furl 
{ful} 
{foul} 
[fjul} 


Consonants 


Stops Fricutives 


pen [pen 
Ben jbeu) 
ten [teu] 
eden {den 
Kay [ku} 
wry uel 
chew [tSu] 
Jew flgu] 


few 

view 
thigh 
thy 

boaay’ 
muy 

shay 
bays 
beige 
Nasuls and literal Ghides 
some [satu] 

sun [san] whey [hwe] 
sung = [say] yeu fj] 

lay fle] ray — [re] 


way = [we] 








sounds, and if every speaker produced these ideal phonemes, English speech would be a 
simple combination of the phonemes. Phones are defined as the actual sounds produced 
by speakers which lead to the understanding of the intended meaning of the sounds. A 
phoneme spoken individually is simple to identify, however when phonemes are spoken in 
normal speech, the beginnings and ends of phonemes are very difficult to identify. In 
addition phoneme sounds may interact with each other. In normal speech, there are 
transition periods between phonemes where slight acoustic variations occur. Therefore, 
with each phoneme is associated a group of these transitional phone variations called 
allophones. 

The basic phonemes in speech are made up of vowels (front, back, central), 
semivowels, diphthongs, fricatives, affricates, stops, glides, and nasals. Speech is also 
classified as voiced and unvoiced. Voiced and unvoiced speech can be separated using a 
combination of two speech analysis techniques called; zero crossing measure, and 
short-term energy measure [3]. The zero crossing measure identifies the number of times 
a sequence changes signs, and the short-term energy measure is used to determine where 
the sequences majority of the energy is located. Unvoiced speech are usually high 
frequency sounds that have large numbers of zero crossings and voiced speech normally 
contain the majority of the energy. Since the zero crossing measure, and short-term 
energy measure identify these characteristics, when used in combination, separation of the 
voiced and unvoiced speech is possible. Voiced speech are sounds that are created with a 


vocal note or sonat (as in the vowel sound in "sat" phonetically spelled [szt]). Vowel 














sounds are quasi-periodic and this period is known as the pitch period. Unvoiced speech 


are sounds t! at are whispered or created without vocal note (as in the constants sounds in 
“sat“). The classes of vowels get their names from how they are articulated, or how the 
tongue is used to produce a sound. Semivowels are vowel-like sounds not caused by 
vowels (the m and n sounds in "man"). Diphthongs are sometimes called long vowels, 
however they are actually the sounds created when transitioning from one vowel sound to 
another in a continuous fashion (as in "being" or "seeing"). Fricatives are voiced or 
unvoiced noise-like sounds used in speech (for example: /z/ and /v/ are voiced phonemes 
while /s/ and /f/ are unvoiced phonemes). Stops or plosives are constant sounds that are 
normally aspirated in English, and where a release of air under pressure accompanies the 
sound (for example: b, d, g, p, t, k). Affricates are formed by transitioning from a stop to 
a fricative (as in "church" and "John"). Continuant sounds like vowel sounds are 
quasi-periodic, and their frequency components can be captured using techniques that rely 
on stationarity. Time-varying sounds like those found in diphthongs and semivowels are 
non-stationary and are classified noncontinuant. 

Figure | shows the recorded speech signal "being”. Figure 2 shows the frequency 
spectrum of the same signal ("being"). The time-frequency spectrogram of “being” is 
shown in Figure 3, where the time increment is 3.4 milliseconds, and the Fast Fourier 
Transform (FFT) length is 512. The time varying voiced components of the frequencies 
are obvious in Figure 3. Thus, the spectrogram shows that there are definite advantages 


to looking at both the frequency spectrum and the spectrogram of a signal. Both the | 
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Figure 1: Recorded speech signal being”, sampling frequency fs = 8192 Hs. 
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Figure 2: Frequency spectrum of the recorded speech signal “being”. 
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Figure 3: Time-frequency spectrogram of the recorded speech signal being”, where the time 
increment is 3.4 ms., and the Fast Fourier Transform (FFT) length is 512. 





strong center frequency components of the vowel-like sounds are identified, as well as the 
variation in the frequency over time which causes different sounds. 

Note that English is a spoken language as opposed to a written language, meaning that 
groups of letters are not always pronounced the same way. As a result, symbols are used 
to express phoneme sounds. These symbols used to represent speech sounds are called 
phonetic symbols, and with these symbols the English language may be represented as a 
written language. There are several phonetic alphabets used for English pronunciation. 
Table 2 shows five common alphabets [4]. 


TABLE 2 FIVE COMMON AMERICAN ENGLISH PHONETIC ALPHABETS 


as 
a a 
a ae 
il aie ae 
[sh [sh 
P zh | ah 
ae 
es a 
a a 
[ng | ng 
ee a 
a 
Te ae 
rae Be 
aa 




















Most dictionaries include a phonetic pronunciation with each word. The phonetic 
alphabet used in this study is the International Phonetic Alphabet (IPA). 

The voiced speech phonemes are quasi-periodic, in Figure 4 (a blowr.-up section of 
Figure 1) the quasi-periodic nature is shown. The fundamental period T = 7.5 ms, of the 
waveform shown in Figure 4, is called the pitch period. The nominal center frequencies of 
the resonances present in the voiced speech phonemes are called formant frequencies, or 
formants. These frequencies would be considered normal speech, or in this case theoretical 
or ideal frequencies. The first three formant frequencies for a voiced phoneme are 
normally labeled F1, F2, and F3. Table 3 shows some basic voiced phonemes and their 


associated average adult male formant frequencies [5]. 


TABLE 3 AVERAGE MALE FORMANT FREQUENCIES 









|| | wml fmm] em [mo 
5 ff ne 0 | | [fe 


Refer to Table | for examples of the sounds listed in Table 3. Note that male and female 





formant frequencies are very different on average. Therefore to eliminate problems due to 
gender differences, this study uses only adult male voices. In addition, age may also create 


some frequency discrepancies in formants, especially between children and adults. The 
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Figure 4: Blown-up section of Figure 1 showing the quasi-periodic nature of voiced speech 
phonemes. The pitch period is T = 7.5 ms. 


database generated for this study uses only males ranging in age between twenty-eight and 
forty with an average age of thirty-one. The table of formants (Table 3) by no means is 
inclusive and does not begin to represent the phones or allophones. Figure 5 shows the 
frequency spectrum of the phoneme /z/ produced by a native English speaking male. 
Recall that the first three ideal average formant frequencies for the phoneme /z/ are 
located at 660 Hz, 1720 Hz, and 2410 Hz, and are indicated on Figure 5. Note that this 
speaker's second and third formant frequencies F2 and F3 are higher than the 


representative F2 and F3 averages. 
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Figure 5: Frequency spectrum of the phoneme /z/ produced by a native English speaking male. 





Ill. AUTOREGRESSIVE MODELS 


System modeling has been investigated extensively in signal processing applications. 
Models can be used for various applications such as; prediction, forecasting, and data 
compression. One of the most used linear models is the AutoRegressive (AR) model [6]. 


In the AR model a signal x(7) is considered to be the output of some system with input 


w(n) where w(n) is white noise with power 6,” = 1. The block diagram of the system is 


given by: 


and the difference equation is given by: 
x(n) = -a,x(n— 1) —a2x(n—-2)—----apx(n—-P)+b,w(n). (1) 


The coefficients, a, for k= 1,---, P, and 5,, are the parameters of the system, and P is the 
order of the AR model. The frequency domain expression obtained from the Z transform 


of (1) is given by: 
X(2z) = -ayz7'X(z) — a_z77-X(z) — --- —apz?PX(z) + boW(2). (2) 
collecting like terms in (2) leads to: 
X(z)[1 tayz7! +z? +---+apz”] = bo W(z). 
Let us define the polynomial: 
A(z) = 1+a,z7! +a,27 +--+ apz7. 


For this study w(7) is white noise with variance equal to one, the resulting transfer 





function of the AR model is given by: 


_ Xe) _ bo 
H(z) = Wz) Az)" 


The AR coefficients can be obtained by solving a set of linear equations obtained from 
equation (1). Using the properties of the AR model, the correlation function R.(1) 


obtained from x(7) is given by: 
R,() =-a,R,(1— 1) —----apR(l—P) + boRwxlD) , 
which leads to: 
R.() +a, R,(I- 1) +--. +apR,(I—P) = boRux (J). (3) 


The cross correlation R,,(/) can be expressed in terms of the impulse response A(7) of the 


AR system: 
Rew(D) = h() * RD. (4) 
Recall that the correlation function of white noise is expressed as: 
Ry) = 62.8). (5) 


Thus, substituting (5) into equation (4) leads to: 
Reo () = A() * 02.8(D) = 02, ACD, 
which leads to: 
Ru) = of h*(-2. (6) 
Next, substituting equation (6) into equation (3) leads to: 
R,() +a, R,(l-1)+---+apR,(1-P) = b,02h*(-l . 


h(n) is the impulse response of a causal filter, where a causal system produces output 
values which are expressed in terms of past and present input values only. Thus, A(”) for 


n <0 is equal to 0. Next using the Initial Value Theorem, we have: 
A(0) =lim H(z) =lim, ——-*— = bo 


z40 1+a,z7!+ : tapzP 


therefore, R_()=5,*0,? for /=0 
RD =0 for />0. 








Expressing (3) for /=0,---, P leads to the following system of linear equations known as 


the Yule-Walker equations: 


R.(0) R(-l) -- R,(-P) l 0215.1? lo, 1? 
R) R,(0) on R(-P +1) a {L 0 _ 0 
Rx(P) RA(P~1) RO) |) gy 0 | 


Figure 6 shows the spectral response obtained from an AR model of order twenty-four 
P = 24 for the speech signal "being", superimposed on the FFT spectrum of the same 
signal. The spectrum of an AR model is the magnitude of the frequency response of the 
AR model's transfer function. Note that the AR model more closely approximates the 
portion of the spectrum with high energy content, which are due to "pole-like" behavior, 
than it approximates the portion of the spectrum where the energy is lower. 

The vowel sounds contained in words are quasi-periodic voiced components. The 
vowel frequencies contain the majority of the power in a single spoken word. The 
assumption here is that for single short words the most distinguishable components would 
then be the vowel-like sounds, and therefore the overall AR model of a word is a "good" 
representation. However, note that the AR model represents in some sense the "average" 
frequency information contained in the word, the non-stationary information present in 
the word cannot be represented by constructing the AR model of a full word. For 
example, results show that differences in the resulting AR models of single syllable full 
words and the voiced phonemes present in those words are very small. Figure 7 shows 
the closeness of the AR models for the phoneme /z/, and the full word "sat", which 


contains the phoneme /a/ where an AR model of order twenty-four is used. Full word AR 
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Figure 6: AR (smooth line) and FFT (jagged line) spectra of the recorded speech signal "being”, 
the correlation method is used to compute the AR model, AR model order is P = 24. 
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Figure 7: AR spectra for the phoneme /z/, and the full word "sat”, AR models are computed 
using the correlation method, AR models order are P = 24. 





models are much easier to produce since no word segmentation is required. For the 
purpose of this study full word AR models are used. The AR model can be used to locate 
the first three formants F1, F2, and F3. The formants however, are not equally energy 
weighted. The lower frequencies usually contain more energy than the higher frequencies, 
and therefore only the frequency range where the first two formants are usually located is 
considered in this study. Table 3 shows that for any vowel the highest frequency for the 
second formant is around 2290 Hz, associated to the phoneme /i/. As a result, this study 
is restricted to the frequency range from 0 to 2400 Hz to consider the effects due to the 
first two formants only. 

The order of the AR model was determined heuristically through experimentation. 
Table 3 is used, and AR models representing the words containing the phonemes of 
interest are produced to express the formant frequencies. An order of twenty-four is high 
enough to represent the spectral information contained. This order may appear to be 
large, however it allows a representation of enough details, while a lower order model 
may cause more information to be lost. Figure 8 shows the twelfth order AR model of the 
word "girl", and Figure 9 shows the twenty-fourth order AR model of the same sequence. 
Comparing the models obtained for order twelve and twenty-four in Figures 8 and 9 show 
that more details are represented with the higher order model. 


The MATLAB™ implementation of the AR spectra is presented in Appendix A. 
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Figure 8: 12th order AR spectrum of the recorded speech signal "girl”, correlation method used 
to compute the AR model. 


3000 3500 4000 


soo 1000 1600 2000 2 
Frequency (Hz) 


Figure 9: 24th order AR spectrum of the recorded speech signal "girl”, correlation method used 
to compute the AR model. 





IV. DATA PREPARATION 


A group of thirty-one male native English speakers are recorded to represent a native 
English speaking model and a native English speaking test group. The thirty-one English 
speakers are divided into two groups; a model group of sixteen speakers, and a test group 
of fifteen speakers. Thirt: -four English speakers were initially recorded. However, three 
were eliminated due to over and under-modulation or an unexplained anomaly in 
recordings. The resulting set of thirty-one native English speakers has an average age of 
thirty-one. 

A second group is formed of non-native English speakers consisting of six male 
Brazilians with a native language of Portuguese. All Brazilian non-native English speakers 
are students at the Naval Postgraduate School pursuing graduate degrees. The Brazilian 
group has an average age of thirty-two, and on average all the individuals in that group 
have spoken English for more than thirteen years. 

The software package used for numeric computation and graphics is MATLAB™. 

A. RECORDINGS 

All native and non-native English speakers are recorded in the same way. A Sun 
Sparc-10 workstation with an audio tool is used to directly record a list of fourteen 
spoken English words. Each speaker is recorded in the same room using identical 


equipment. The instructions given to the speakers are to relax, speak using their normal 


Ze 


voice, and pause momentarily between words to make word segmentation easier. The 
word list is reviewed by each speaker before the recording is started to ensure that every 
word on the list is understood. Each speaker is recorded saying the list of words twice. 
After the word list is recorded for the first time, the data file is saved and the process 
repeated. The word lists are digitized as recorded using a sampling frequency of 


8192 Hz. 


B. WORD SEPARATION 

The process of data preparation begins with loading each data sequence, a list of 
fourteen spoken words, into MATLAB™. The word list is then plotted and cut into 
individual words visually. Each word is saved as a separate data file, and excess 
non-speech is trimmed from each spoken word. When cutting and trimming is completed, 
the separated list of fourteen words is saved as a data set. The resulting word data 
sequences consist of single spoken words with little excess silence before or after the 
word. Each speaker contributes two complete sets of data from the two times the word 


list is recorded. 
C. FILTERING AND NORMALIZING 
Each word data file is filtered and normalized before any processing begins. 
1. Filtering 
The normal speech frequency range is between 100 and 4000 Hz. A high-pass 
Finite Impulse Response (FIR) filter [7] with a cutoff frequency equal to 100 Hz is 


designed to eliminate the sixty Hertz equipment noise. Figure 10 shows the frequency 
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Figure 10: High pass, 48th order, FIR (Finite Impulse Response) filter with pass band frequency 
equal to 100 Hs. The sampling freuqncy is fs = 8192 Hz. 
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response of the high-pass filter. A low-pass Butterworth filter [7] is designed to eliminate 
all the frequencies above 4000 Hz. Figure 11 shows the frequency response of the 
low-pass filter. 

2. Normalization 

To achieve a goal of standard comparison between the spoken words, an energy 

normalization is required. Note that each word is spoken at different loudness levels and 
over different periods of time, as some speakers speak faster than others. Even though the 
environment of the recordings is constant, the speakers are very different. Each word data 
sequence has its mean removed before, between filters, and after the filtering is complete. 
To minimize the effects of loudness and variations in time or sequence length of the 


recorded speech, the following normalization is used: 


Ndata = ———iaia 


{EdatarEdara” * 





where Ndaig is the normalized word data sequence, and Fdajia is the filtered word data 
sequence. 

As a result, all AR models of a normalized word data sequence have the same 
energy without regard to speaker or word spoken. A check of the normalization can be 
conducted by finding the energy in the normalized word data sequence. The energy in 
each word data sequence is equal to one. 

The result of the data preparation is a data sequence that can be comparatively 


analyzed with other data sequences prepared in the same manner. The potential effects of 
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Figure 11: Low pass, 8th order, Butterworth filter with cut-off frequency equal to 4000 Hs. 
The sampling frequency is fe = 8192 Hs. 
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loudness variations and data sequence length of speech have been reduced, and the effects 


of energy variations and frequency concentration have been enhanced. 
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V. FOREIGN ACCENTS 


This chapter is used to express how accents are established, and why non-native 
English (foreign) speakers that have the same native language have similar accents. when 
speaking English. The foreign accent similarities existing between these non-native 
speakers may potentially be used to identify the individual's native tongue and the country 
where they were raised. 

The production of a foreign accent may be caused by many different factors, to include 
when, where, how, and why another language is learned. The theory used in this study is 
based on non-native English speakers learning English when they are well established 
speakers of their native language and are no longer children. 

A. ACCENT PREMISE 

The premise of limits on phonetic accuracy, [1, 2] may be simply stated as the old 
phrase, "You Can't Teach Old Dogs New Tricks". The speaker's native language is the 
source language, and the non-native spoken language is the target language. The phones 
of the speaker's native language are identified as LI phones, and the phones of the target 
language are identified as L2 phones. Foreign accents may be caused from the production 
of sounds in the target language that are not used in the source language. Thus, the 
sounds in the target language that are not present in the source language will be the 


sounds most difficult to produce because these foreign sounds have never been used. This 
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production of accents premise is that if the target language has a sound that is not used in 
the speaker's source language, then the speaker will substitute an existing sound in the 
source language for the sound in the target language. If the source language sound is 
similar enough to be understandable, then the non-native speaker has no immediate 
incentive to improve on the pronunciation of the L2 phone. The production of similar 
accents from speakers with the same native language is then caused by the similar 
substitution of L] phones for L2 phones. The identification of the L1 for L2 substitution 
is the key to recognizing a foreign accent. Note however that, not every non-native 
speaker learns a new language in the same way and not all L] phones are pronounced the 
same. The seemingly simple task of identification of foreign accents is actually quite 
difficult to do automatically, and the difficulties in the identification process increase as the 
proficiency of the non-native speaker in the foreign language improves. The more 
phonemes that are not present in the speakers source language, the easier it should be to 
find accent possibilities. The idea is to start with the phonemes that are different from the 
source language however similar enough to cause a substitution of L1 for L2 phones, and 


then to look at the phonemes that had to be learned from scratch. 


B. WORD LIST SELECTION 


The word list selection is accomplished by identifying the L2 phones that are most 
difficult to pronounce for the foreign speakers. These phones may not be brand new 
phones, they may be target phones that are just close enough to existing phones in the 


speaker's source language so that a substitution of L! for L2 phones seems harmless. A 
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brief interview of six native Brazilian speakers revealed the sounds that were most difficult 
to produce were the English phonemes /z/, /f0/, /»’ and the sound created when 
pronouncing "rl" (as in "world" and “girl"). All these sounds are incorporated in the word 
list used. The word list is chosen using different phonemes in similar words so that when 
the similar words are spoken the only difference in the pronunciation is the phoneme of 
interest. Table 4 shows the word list used for the recordings, the vowel phonemes with 


the IPA, and the formant frequencies associated with each vowel phoneme. 


TABLE 4 FOURTEEN-WORD LIST WITH VOWEL FORMANTS AND 
PHONEMES 


International 
Phonetic 
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Only the voiced vowel phonemes are shown even though the entire word for each 


recording is AR modeled. Recall that differences in the resulting AR models of the full 


words, and the vowel phonemes contained in those words are very small (see Figure 7). 


3] 





VI. PERFORMANCE MEASURES AND TESTING 


This chapter presents performance tests which measure how well each speaker 
pronounces the selected set of words listed in Table 4 which contain particular phonemes, 
in comparison to a diversified reference group of native American English speakers. Five 
performance measures using AR models obtained from given words are used. AR models 
are produced using the entire word for each word on the word list in Table 4. The 
frequency region for the AR models is limited to the interval 0 to 2400 Hz, as described in 
Chapter III. As a result, the total number of points in each AR model sequence of order 
P = 24 is N = 300 which corresponds to 2400 Hz, given that 512 points are used to 
represent the AR frequency response. The five performance measures include; the Itakura 
distance (3, 8], the normalized cross-correlation coefficient and the modified normal 
cross-correlation coefficient [9], the log spectral distance [10], and a "bounds" measure 
defined in this study. 

The list of fourteen words shown in Table 4 is repeated twice by each speaker which 
leads to a set of twenty-eight words per speaker. Each word is modeled using an AR 
model of order twenty-four. From the thirty-one native English speakers, sixteen are 
selected for an English speaking reference group. The remaining fifteen native English 
speaker's recordings are performance tested against the reference group. The Brazilian 
recordings are also performance tested against the native English speaking reference 


group. The test group consists of all the native English speakers not in the reference 
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group and all the non-native English speakers (Brazilians in this study). A reference AR 
model for each word is produced by calculating the mean of the sixteen selected AR 
models obtained from the native English speakers in the reference group. Figure 12 shows 
the reference AR model, highlighted with asterisks, from the AR models for a selected 
reference group of sixteen native English speakers for the word "girl". The reference 
model is used as the basis for all of the performance measures except the boundary 
measure. The following sections first describe each performance measure, and next 
explain how each AR modeled word in the twenty-eight word set is tested against a 
reference. 
A. SYMMETRIZED ITAKURA DISTANCE 

The Itakura distance enhances the effects of spectral differences due to the locations of 
the AR model peaks [3]. The AR model peaks indicate the formant frequencies present. 
The valleys of the AR model are not enhanced, therefore the errors from the differences in 
the valleys between the reference model and the tested speaker are not weighted as heavily 
as the differences in peaks. The formant frequencies are the frequencies of interest and 
here in determining the quality of the phoneme pronunciation or if a foreign accent is 
present. 

The Itakura distance has been used extensively in speech applications [3]. It is not a 
metric which means it does not have the symmetry property. For example: if v(w) is the 


spectral information corresponding to a speaker to be tested, and Ref{w) is some reference 


model (obtained as the magnitude mean of the reference group), and /th(v(w),Ref(w)) is 
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Figure 12: AR spectra obtained for the word "girl” for sixteen native male English speakers; 
resulting mean spectra (reference model) highlighted with asterisks. 


34 








defined as the Itakura distance between them then: 
Ttk(v(@), Reflw) # Itk(Ref{w), v(w)). 


To eliminate the above problem, the symmetrized Itakura distance measure is defined as 


[8]: 





_apt [ft “al ™ Ref) 
Leh», Ref) = Ind |( J eeideo}( f APaw ) (7) 
Equation (7) satisfies the symmetry property: 
Itk(v(w), Ref(w) = /tk(Ref(w), v(w)). 


The symmetrized Itakura distance does have the property that a measure from two 
identical AR models is zero, for example: 

Itk(Ref, Ref) = 0 and Itk(v,v) = 0, (8) 
therefore as the symmetrized Itakura distance between two spectra increases, the 
similarities between these two spectra decreases. 

1. Application of the symmetrized Itakura distance 
The AR model obtained from a speech signal s(7) is expressed in terms of the AR 


coefficients defined in Chapter II] as: 
A(z) =1+a,z7'! +a2z?+-..+apz? and the gain b,. 
The spectrum S(q) of s(7) is obtained from the magnitude squared of the frequency 


response of the associated transfer function: 


2 
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The MATLAB™ implementation of the symmetrized Itakura distance measure is 
presented in Appendix B. 
2. Testing using the Itakura distance 


A reference model tested against itself produces a measure equal to zero as shown 
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in equation (8). Every frequency response in the selected reference group is tested against 
a reference model (recall that the reference model is the mean of the selected AR spectra 
obtained from the reference group). The largest Itakura distance obtained from the 
reference model to all models contained in the reference group is labeled Ref... Next. 
each model contained in the test group is compared to the reference model and the 
resulting distance is compared to Ref,,,,,,. Every speaker in the test group that has an 
Itakura distance measure larger than Ref,,,,. is marked as a failure for the Itakura measure 


distance test corresponding to that word of the twenty-eight word set. 


B. CROSS-CORRELATION COEFFICIENT 
Two cross-correlation coefficients are used in this study; the normalized 
cross-correlation coefficient, which is referred to as cross-correlation-1, and the modified 
normalized cross-correlation coefficient measure, which is referred to as 
cross-correlation-2 [9]. Both cross-correlation measures use the reference model 
described in the introduction to this chapter and illustrated using the word "girl", as shown 
in Figure 12. 
1. Normalized cross-correlation coefficient 
The normalized cross-correlation coefficient cross-correlates the reference 


model with AR models to be tested and normalizes the results: 


N-1 
x r(n)t(n) 
rrad) 
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where p,, is the normalized cross-correlation coefficient measure, (1) is the AR spectrum 
obtained for the reference model, f(7) is the AR spectrum obtained for one word in the 
test group, and N is the number of frequency points considered for the test (for this study 
N= 300). Note that, p, has a numerical value between zero and positive one. A 
numerical value of one p, = | means that the two sequences r(7) and (7) are identical, 


while a numerical value of zero p,, = 0 means that the two sequences 7({) and (7) have 
zero percent correlation. The normalized cross-correlation coefficient measure determines 
the percent of correlation between the reference model and any test AR model. 
2. Modified normalized cross-correlation coefficient 
The modified normalized cross-correlation coefficient is defined the same as the 
normalized cross-correlation coefficient measure, except that before the procedure of 
cross-correlation, the mean of the reference model and each test AR model are removed. 
The range of possible numerical values for the modified normal cross-correlation 
coefficient measure is between negative and positive one. The case of identical sequences 
with a numerical value of one p,, = 1 still holds. For the case of no correlation, the 
numerical value would be zero p,, = 0. 
3. Application of the cross-correlation coefficients 
Cross-correlation-1 and cross-correlation-2 are implemented using the same 


procedure. For every word in the word list, each AR model is tested against a reference 


model using both cross-correlation-! and cross-correlation-2. The selected reference 





group is screened to determine the minimum numerical value for both cross-correlation 
tests. The minimum value from the reference group f,,,,, is compared against each 
numerical value P,,,,, Calculated from the AR models in the test group. The MATLAB™ 
implementation of the cross-correlation coefficients is presented in 
Appendix C. 
4. Testing using cross-correlation coefficients 

For every speaker in the test group, the magnitudes of the cross-correlation 
neasures are compared against the minimum value [p,,,,,| of the reference group. Every 
speaker in the test group may receive a failure for each time a recorded word has a 
cross-correlation numerical magnitude less than jp,,,,,|-_ A total of four failures may be 
received for the cross-correlation tests for a single word since each word is recorded twice 


and both cross-correlation measures are used. 


C. LOG SPECTRAL DISTANCE 

The log spectral distance uses the reference model described in the introduction to this 
chapter and shown using the word "girl" for a selected reference group in Figure 12. The 
log spectral distance computes the sum of the difference between the frequency 
components of the AR spectrum, expressed in dB, obtained for the reference model and 
any of the components in the test group. The resulting log spectral distance expression is 


given by: 


N 
CB => |log(4Ry,) -log(ARr,)|, 
re} 








2 
where CB is the log spectral distance, AR,,, is the spectrum value at frequency location 1 


, oe 
for the reference model, AR,, is the spectrum value at frequency location 312 for a 


component of the test group, and the parameter N is the number of frequency points 


considered in this study. 
1. Application of the log spectral distance 
The log spectral distance is used to test each AR model in the test group against 
the reference model for each word recorded. The MATLAB™ implementation of the log 
spectral distance is presented in Appendix D. 
2. Testing with the log spectral distance 
Every AR modeled word is tested against the reference model, including every 
word in the reference group. The maximum log spectral distance calculated for the 
reference group is used to compare each calculated log spectral distance from the test 
group. For each log spectral distance from the test group that is greater than the 
maximum log spectral distance obtained for the reference group, a failure is marked for 
that speaker. Each speaker may fail the log spectral distance twice for each word on the 


selected word list (Table 4), since each word is recorded twice. 


D. "BOUNDS" MEASURE 


The "bounds" measure is used to identify differences in AR model frequency locations 
or AR model shapes indicating different sounds. The reference model described in the 


introduction of this chapter is not used for the "bounds" measure. The AR model 
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spectrum magnitude upper and lower values are the reference bounds for this measure. 
The reference upper bound is obtained by taking the maximum magnitude for each 
frequency component of the AR spectra from the native English speaking reference group. 
Similarly, the reference lower bound is obtained by taking the minimum magnitude for 
each frequency component of the AR spectra from the native English speaking reference 
group. Figure 13 shows a selected native English speaking reference group of AR models 
for the word "girl" with the bounds highlighted with asterisks. 

1. Application of the "bounds" measure 

The reference bounds are computed for each word on the word list (Table 4), and 

then each AR model of the native English and non-native English speaking test groups are 
tested against the bounds. The MATLAB™ implementation of the "bounds" measure is 
presented in Appendix E. 

2. Testing using the "bounds" measure 

The AR modeled words are tested by determining the percentage of each AR 

modeled word from the test group that is outside the bounds. The speaker is marked with 
a failure for the "bounds" measure for each word when for both times the particular word 
is recorded, five percent of the magnitude of the frequency response of the AR model is 
located outside the reference bounds. Experimentally, five percent of the magnitude of the 
frequency response of the AR model outside the reference boundary proved to be 
satisfactory for the list of words considered. Each speaker may only receive one failure of 


the "bounds" measure for each word on the selected word list (Table 4). 
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Figure 13: AR spectra obtained for the word “girl” for sixteen native male English speakers; 
resulting reference ”bounds” highlighted with asterisks. 
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VII. MODELS AND TEST RESULTS 


The goal of performance testing is to ensure results are achieved from a robust set of 
tests with a detailed method of analysis. The five performance measures used here to 
determine whether a given speaker is or is not a native American English speaker have 
been described earlier in Chapter VI. The results of the five performance measures are 
combined to determine if the tested speaker is a native English speaker, a non-native 
(foreign) English speaker or marginal. A marginal rating means that the speaker could be 
a native or a non-native English speaker. A non-native English speaker with a minor 
accent could potentially pass as a native English speaker and conversely, a native English 
speaker with an anomaly in his phoneme production or with a regional native English 
accent not sufficiently contained in the reference group could fail as a native English 


speaker. 
A. REFERENCE MODELS 


Reference models are used to determine how well speakers in the test group 
pronounce phonemes contained in full words. Recall that the reference model is obtained 
from a selected reference group of sixteen speakers from the thirty-one native English 
speakers recorded. To eliminate potential bias in the results, six reference groups are 
selected and used to obtain a reference model and reference bounds for the "bounds" 
measure. Each reference model and reference bound is then used to test the associated 
test group. Recall that the test group consists of the native English speakers not included 


in the reference group and the non-native (Brazilian) English speakers. 
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The native English speakers are numbered in the order they are recorded. The 
numbering scheme for the native English speakers is $1, S2, -- -, $31. The Brazilian 
speakers are labeled similarly where B1 is the first Brazilian non-native English speaker 
recorded, and B6 is the last Brazilian non-native English speaker recorded (B1,B2, - - -, 
B6). The first reference group, labeled RG] for simplicity, used in this study consists of 
the first sixteen native English speakers recorded, i1.e., RG] = (S1, $2, ---, $16). The 
next fifteen native English speakers recorded as well as the Brazilians recorded make up 
the test group, labeled TG1 = (S17, S18, - - -, S31, B1, B2, -- -, B6). The second 
reference group consists of the last sixteen native English speakers with respect to the 
order in which they were recorded. RG2 = (S16, S18, - - -, S31), and the associated test 
group is TG2 = (S1, S2,---, S15, B1, B2, -- -, B6). The third reference group consists of 
all the odd numbered native English speakers with respect to the order in which they were 
recorded RG3 = (S1, S3, - - -, S31), and the associated test group is TG3 = (S2, S4,---, 
S30, B1, B2, -- -, B6). Additional reference and test groups are selected randomly and 
they include: 

- RG4=(S1, $2, $3, $8, $12, $13, $16, $18, $20, $21, $22, S23, $24, $26, $28, $29), 
with TG4 = (S4, S5, S6, $7, $9, $10, $11, $14, $15, $17, $19, $25, $27, $30, $31, Bl, 
B2,:- :, B6); 

- RGS5 = (S4, 88, 89, $10, $13, $14, $16, $17, $18, $21, $25, $26, $27, $28, $29, 


$30), with TGS = (S1, S2, $3, $5, $6, $7, S11, $12, S15, $19, $20, $22, $23, $24, $31, 


B1, B2, : - -, B6); 











- RG6 = (S1, S2, $3, $4, $5, $6, $7, $12, $14, $19, $21, $23, $25, $26, $29, $30), 
with TG6 = (S8, S9, $10, S11, $13, S1S, $16, S17, $18, $20, $22, $24, $27, $28, $31, 
B1, B2, - - -, B6). 

The native English speakers are from many areas of the United States, the states 
included are: California, Connecticut, Florida, Kansas, Maryland, Massachusetts, 
Minnesota, Mississippi, Missouri, New Jersey, New York, North Dakota, Ohio. Oregon, 


South Carolina, Virginia, ang Wisconsin. 
B. THRESHOLDS 


Chapter VI defines individual word failures for each performance measure. Recall that 
all of the performance measures except the "bounds" measure record failures for each 
word individually and since each word on the word list in Table 4 is recorded twice a set 
of twenty-eight words are potential failures. The "bounds" measure uses a method which 
combines the results obtained from both recordings of a given word, leaving a potential of 
fourteen word failures. Next, thresholds are required to determine how many of these 
word failures constitute a failure for each performance measure. The performance 
measure thresholds are related to the number of words on the word list in Table 4. 
Thresholds are also required to determine the number of nerformance measure failures 
that establish a rating of each speaker as; native English speaker, non-native English 
speaker, or marginal. The rating thresholds are not dependent on the number of words on 
the word list in Table 4. Note that all thresholds are obtained heuristically through 


experimentation. Ideally, the results should show that all native English speakers tested 


44 











against any of the native English speaker reference groups are determined to be native 
English speakers, and that all non-native English speakers tested against any native English 
speaking reference group are determined to be non-native English speakers. Table 5 
shows the thresholds set for the number of word failures that are required for each 
performance measure to be considered a failure. 


TABLE 5 THRESHOLDS FOR PERFORMANCE MEASURE FAILURE USING THE 


FOURTEEN-WORD LIST 





The test names are abbreviated such that; ITK is the symmetrized Itakura distance , UR is . 
the normalized cross-correlation coefficient, RR is the modified normalized 
cross-correlation coefficient, CB is the log spectral distance, and BND is the "bounds” 
measure. 

Rating thresholds are determined experimentally and are not related to the number of 
words on the word list in Table 4. The thresholds for the number of performance measure 
failures that establish a rating of either native English speaker, marginal, or non-native 
English speaker, are listed in Table 6. 


TABLE 6 THRESHOLDS FOR RATINGS 








The number of performance measure failures is arrived at by counting the number of 
failures that meet the thresholds established in Table 5, and adding to that number 0, if 0 
is recorded for any one performance measure, and adding 2 if there are no zeros recorded 
for any one performance measure. This offset was chosen to enhance the results for a 
perfect score for any one performance measure. The results of this study (using all 
reference groups, with the reduced word list explained later in this chapter) show that 
89% of all native English speakers score 0 for one of the five performance measures when 
tested. Table 7 shows an example of how the ratings are calculated. The columns in 
Table 7 labeled; ITK, UR, RR, CB, and BND contain the number of failures of each 
performance measure using the rules for failure established in Chapter VI. The number of 
performance measure failures established by the thresholds listed in Table 5 are counted 
and recorded in the cofumn labeled PMF (Performance Measure Failures). The column 
labeled ZS (Zero Scored) reflects the results of a zero recorded for any performance 
measure (zero for a zero recorded and two for no zero recorded). The column labeled 
Total is the total of the two columns labeled; PMF and ZS. The column labeled Rating is 
scored by reviewing the numbers listed in the Total column and using the thresholds listed 
in Table 6. A speaker is given a rating of: N for native English speaker, F for non-native 


(Foreign) English speaker, and M for marginal. 
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TABLE 7 EXAMPLE RATING CALCULATIONS 
Speaker 








C. TEST RESULTS 


The test results shown in the following tables have the same format as the example 
shown in Table 7. The results for test group | (TG1) with reference group | (RG1) are 
presented in Table 8. The results from TG! show that 100% of all the non-native English 
speakers tested received a rating of foreign, none of the non-native English speakers 
received a rating of marginal or native, 73% of all the native English speakers tested 
received a rating of native, 20% of the native English speakers received a rating of foreign, 
and one native English speaker corresponding to 7% received a rating of marginal. All of 
the other five test group results are calculated in the same way as for the example in 


Table 7 and for the results shown in Table 8. 
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TABLE 8 RESULTS FOR TGJ WITH RG! (FOURTEEN-WORD LIST) 
Speaker | ITK | UR | RR | CB | BND] PMF] ZS | Total} Rating 


si7_ ft ofofofot2z}ofofo} nN | 
sig_ of 2t2}3 To} 3}ofofot Nn | 
si9_ Jofofofofofofofofn | 
soo fotofotototoftotot ns 
sz1 Pap 2p zt pop ep NN 
s22_ pots totTiptitiftoftiy Nn | 
sa 2 tits pop sti totaly Nn 
s24 Pt tt Pop tN 
sos} 3 foftotay2ztitotit Nn 
s26_ Pt fof opi ty 2totopfot Nn 
so7_ fo fofofTotiftofofot n | 
sg tte tat to | 
so PoP apa tap at spat 
30 Pups ztspots|2y7t F 


sit stot2;4i7}3fo [31m | 

ee) ee ee Ee Ge Re 
BE spat ts tte el | 
B2 op 4 pat spats ts [2 t7 fF 
B32 ee sp ttl 
Ba tz tot ttt te Pst zt 


BS Psst ts Tz tat zo fF 
Bo J 2e 7 4t2tstst2i sf] 


Table 9 summarizes the test results obtained for all combinations considered. The 
abbreviated headings for each column in Table 9 are: PNRN (Percentage of Native 
speakers Rated as Native speakers), PFRF (Percentage of Foreign speakers Rated as 
Foreign speakers), PNRM (Percentage of Native speakers Rated as Marginal), PFRM 


(Percentage of Foreign speakers Rated as Marginal), PNRF (Percentage of Native 
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speakers Rated as Foreign speakers), PFRN (Percentage of Foreign speakers Rated as 
Native speakers), and STD (STandard Deviation). 


TABLE 9 SUMMARY OF TEST RESULTS FOR FOURTEEN-WORD LIST 


78 % 74% 2% 11% 20% 15% 
1 STD 16% 14% 3% 12% 14% 12% 


Results shown in Table 9 for all tests using six different reference groups indicate high 





levels of missclassification. Thus, the word list shown in Table 4 must be restricted to the 
words which are considered by the non-native English speaker as the most difficult ones to 
pronounce. 

The fourteen-word list from Table 4 is reduced to five words: “man", “zap", “girl", 
"seeing", and "word". Recall that the performance measure failure thresholds are 
dependent on the number of words contained on the word list, therefore the thresholds for 
the reduced word list are also reduced. Table 10 shows the performance measure failure 
thresholds for the reduced five-word list. 


TABLE 10 THRESHOLDS FOR PERFORMANCE MEASURE FAILURE (5 WORD) 








The thresholds that determine the rating of a speaker remain constant and are listed in 
Table 6. 

Table 11 through Table 16 show the results for TG1 through TG6 using the reduced 
five-word list. Table 11 contains the results for TG! with RG1. 


TABLE 11 RESULTS FOR TG! WITH RG] USING THE FIVE-WORD LIST 
Speaker | ITK | UR | RR | CB | BND] PMF] ZS | Total] Rating 


si7_ of ofotofot2tofojfofn | 
sis fop2tifto{3si2to}t2}n | 
si9_ TotTofTototoftoftofo} nN 


soo fofoftototototofo} ns 
suf ofa tit itotototo} ns 
s22_ fof itotiftiftoftojfot nx) 
s3  fofotifotoftotofo}t Nn) 


sq fofofotztitifoftilt n | 
sf uf ofotiftoftotoj}o} n | 
so fofoptotifijojofo}] ny 
s7_ ftototoftotifototo} ns 
Tr Eee eee 
Pre EE En ES ERR 
s3o_ [3 t+ pa tit stat zie [| 
ssi piftotot2ztst2ztot2i n | 


Bp fifitif3{i1{i[2/3]™_ 
B2 as sas ttt 7 
Bef 2p 2 2t ass Pt 
B4 toto fot spats t2t7] Ff 
BS ot 2t2;2;a72pat2teo ft FL 
Bo Pz Paps te2ty2tse tz te TF 


50 





Table 12 contains the results for TG2 with RG2, Table 13 contains the results for TG3 
with RG3, Table 14 contains the results for TG4 with RG4, Table 15 contains the results 
for TGS with RGS, and Table 16 contains the results for TG6 with RG6. 


TABLE 12 RESULTS FOR TG2 WITH RG2 USING THE FIVE-WORD LIST 
Speaker | ITK | UR | RR | CB |BND| PMF| ZS | Total] Rating 


sip of oti ftoftofoftot N | 
sz i foptoptot2ztotofot Nn 


3 fo fotototijfototot nN | 
sa Pt tof zp o fet NN 
sso ot 4totifot2zpifopil Nn | 
sof ofojpyototijojojojy nN 
s7_ fof i tofot2ztofojtoj n | 
sof ofofoftotifofoftot nN 
so titi fototofopfofot nN | 
sio Jopifofotitotofo} n | 
suf af fofotztitoptit nN 
siz J ujpopfoptot2ztotfofotn 
si3_ ti Po fof ofoftototot N | 
si4 Ps Pe Pt ts 
sis Jopfofofoftotoftojo] n | 

SE MG GE Ge YO OE Ee 
BL Poti ti tz toftotot NS 
7 RAE ERE EEA 
Bs tz lo 








TABLE 13 RESULTS FOR TG3 WITH RG3 USING THE FIVE-WORD LIST 
Speaker } ITK |] UR | RR | CB | BND] PMF] ZS | Total] Rating 


sz pop ofop zy iyi potit N 


so frl2tztololztol2 [nN 
so fofolofolfofolo|o]n_ 
ss Tololofoli}olo|fo] nv 
so fofilolil2z}tolofol| Nn 
sz fpofolol2t2tifoli [nN 
sa filzletile lolz 
so Toftotolit[o{[ololo| Nn 
sia fof3atololitlifoli| Nn, 
20 fofololfololfolo}ol|n_ 
sxx fpolilolitifolo]ol]Nx_ 
sa fofolfo}fololololo] nN 
6 fololTolilijfolo|[o| nx. 
se potilololilolo|ol Nn 
so J2teilsiselotsizt7l 
B Irtititlalelolzt2tn | 
pe Iolelelsiolsl2tl7 
Bp l2lz2l2lolil«letol 
Be telstsistl«lsil2t71 
B l2lz2l2le2le2lelolel - 
Bfol2tztiiii2ztol2t nt 














TABLE 14 RESULTS FOR TG4 WITH RG4 USING THE FIVE-WORD LIST 
Speaker | ITK ] UR | RR | CB |BND| PMF] ZS | Total] Rating 


S4 
S5 
S6 
S7 
S9 
$10 
S11 
$14 
S15 
$17 
$19 
$25 
S27 
$30 
$31 


Bl 
B2 
B3 
B4 
B5 
B6 





poftoftoftovftovtot{ ni 


Be 


L2t4,2jpe} F 
poftoftofotn 


we 
8 
~ 








TABLE 15 RESULTS FOR TGS WITH RGS USING THE FIVE-WORD LIST 
Speaker | ITK | UR | RR | CB [BND] PMF] ZS } Total] Rating 


sup to fo Pz ti po fat NN 
sz fof otopotifojotot Nn 
sx ff ofofoftofofoftoftot n | 
sso of oftoftototityofojfyot nN 
sof ofoftotototofofoj} nN 
s7_fofofofotitoftotot n| 
su so tt flo fl tM 
siz_ fofofofo}2tojfofo] n | 
sis_ sf ofotoftoftojojofo} nN 
sion Tojvoto}toftofofofo] nN 
soo J ojfototototojfoj}otn 
sz2_foftofofoftofofofo{n | 
s3_ of ofotototojtoftofo}] Nn 
sq fTofototTifyoftoftofot nN 
su ti foftotsatsat2ztot2t no 


a t2{i[:[;|]2]2]21-| + 
Yi REE ERE 
BB zt | 
Bats tastes ttl | 
Te Ee Pe Pe ee eee 
Bo tititiftoft2z{tofojfo] Nn | 














TABLE 16 RESULTS FOR TG6 WITH RG6 USING THE FIVE-WORD LIST 
Speaker | ITK | UR | RR | CB | BND] PMF] ZS | Total] Rating 


ss fofofeofolofofofo[n 
so fofoltofofofololo[n_ 
so fofo}of2t2tifoli[ Nn 
si_t3ti}1J2}2{2}2[«l[e_ 
six fofofofo{i}fofolo[n_ 
sis_fofofofolofofofo]n_ 
sis_fofo}ofifofofofo|] n_ 
7 _fofofofoli{o}fofo] n— 
sis_fofofofofolofotol n_ 
soo fofo}ofofofofofo| n_ 
sz tofifofoloftofofo] x 
soa tofofoti{i}o. 

so7 _fotofofotolo 
so fofofofoli}o 
si_fofofof2|2{i}o. 


aaa 
Bip fot ts at zt ope 
B2 7 
B32 
Ba Ss | s Js | 
BS 
BoP 2 r pit 233 





Table 17 summarizes all six test group results using the reduced five-word list, and this 
recapitulation shows more convincing results. The abbreviations for the column headings 
in Table 17 are the same as for Table 9 (p. 49). 


TABLE 17 SUMMARY OF TEST RESULTS FOR THE FIVE-WORD LIST 


Test Group 


7 


0 2 
2 


88 % 80.5 % % 2.8 % 10% 16.7% 
7% 11% 3% 6% 6% 13% 


Comparing results from this reduced word list with those listed in Table 9 (p. 49) 





obtained when using the fourteen-word list shows that a distinct improvement in 
classification performance has been obtained, however PFRN shows a small degradation. 
This comparison also shows that the selection of the word list is one of the key factors in 
the automatic classification of native versus non-native speakers. 


The MATLAB™ implementation of the results is presented in Appendix F. 


56 











Vili. CONCLUSIONS 


The goal of accent recognition investigated in this thesis is to automatically detect 
non-native (foreign) English speakers as foreign, and native American English speakers as 
native using AR modeling. The processing techniques are simple to implement and data 
preparation is automated. The entire process from spoken word to rating of speaker can 
be automated for practical use. 

This thesis considers the use of a few single syllable words common in daily 
speech, and focuses on one group of non-native English speakers, with the notion the. the 
techniques used for accent detection may be extended to recognize non-native English 
speakers from many languages. The non-native English speakers selected for this study 
are all Brazilian students attending the Naval Postgraduate School. The word list used is 
made up of words that are difficult for native Brazilians to pronounce. The native English 
speakers used in this study are originally from various regions of the United States and are 
all military servicemen which limits regional accent due to the many areas of their travels 


and residences. 


Results show that an average of 88 % of all native speakers tested are rated as native, 
and that an average of 80.5 % of all foreign speakers tested are rated as foreign. Six 
different reference groups of sixteen native English speakers are separately used to test 


fifteen native and six non-native English speakers. The robustness of the techniques is 
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improved by using various reference groups and maintaining the ability to produce similar 
results. 

The results produced by this study are encouraging as they show that it may be 
possible to detect foreign accents. However, these results may be improved by; choosing 
better words for the performance tests, maintaining a cleaner environment for recording, 
and adding a time varying analysis technique to the performance measures. First, choosing 
better words for the performance test could improve results as word selection is critical to 
achieving accent recognition. Words may exist that have more consistency among native 
English speakers and cause more variances from the reference groups for foreign speakers, 
which would produce better results overall. Second, maintaining a cleaner environment 
for recording may provide higher accuracy for the AR models and emphasize the 
differences between native and non-native English speakers, which would improve the 
classification process. Finally, adding a time varying analysis technique to the 
performance measures may enhance the results by better showing the differences in the 
pronunciation of long vowels (diphthongs). The difficulties encountered in these 
procedures come in the form of relating different speakers pronouncing the same word 
over different duration's of time, and additional processing such as Dynamic Time 
Warping is then needed to align the spoken words. Another alternative would be to 
compare spectrograms (three dimensional spectra), and to compare the time sequencing of 
the frequencies present. One of the phenomenon discovered during this research is that 


the Brazilian speakers involved in the study pronounce diphthongs in a time increment that 
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does not support the sounds required. For example, they pronounce long vowel sounds 
too fast. However, it is difficult tc match a native English speaker who speaks quickly 
with a Brazilian who mispronounces sounds by pronouncing them over too short a period 
of time. 

An additional approach involving Cepstral analysis was investigated. However, we 


noted that it did not produce satisfying results for the tests designed for the AR models. 
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APPENDIX A MATLAB™ IMPLEMENTATION OF AR SPECTRA 


% calcul.m, Calculates the AR spectra and stores them in matrix form 
% calcul.m calls function arcorfmp.m (MATLAB™ implementation follows cal.m) 


% Inputs 

% gis the digitized word sequence to be AR modeled 

% N is the length of the AR sequence output 

% fs is the sampling frequency of the digitization 

% P is the order of the AR model 

% E is the number of speakers to be modeled 

% [g]e# is the digitized word sequence for native English speaker # 

% [g]e#a is the digitized word sequence for native English speaker # second recording of 
%the same word 

% [g]# is the digitized word sequence for non-native English speaker # 

% ([g]#a is the digitized word sequence for non-native English speaker # second 
“recording of the same word 

g = input(‘Enter the name of the word to be modeled:','s’); 

N= 512; 

fs = 8192; 

P= 24. 

E=31; 


% Calculate the AR spectra for each native English speaker 
for m= 1:E 
nn = num2str(m); 
[MHz] = arcorfmp(eval([g,'e',nn])); 
MHze(:,m) = MHz; 
end 
% Calculate the AR spectra for the second recording of the same word 
for ma = 1:E 
nna = num2str(ma); 
[MHz] = arcorfmp(eval({g,'e',nna,'a'])); 
MHzea(:,ma) = MHz; 
end 
% Create matrix of AR models for English speakers 
AA = [MHze,MHzea]; 


% Clear variables 

form =1:E 

mm = num2str(m); : 
clear ({"beinge’. mm]);clear ({'fifthe’,mm]),clear (['girle’,mm]) clear (['mane’,mm)); 
clear ({'mene',mm});clear ({'sate’,.mm]) clear (['seeinge',mm]);clear (['sete’,:mm]), 
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clear (['site'.:mm]);clear (['threee',:mm)]):clear ({‘treee',mm]);clear ({'worde'.mm]); 
clear (['worlde',mm]):clear (['zape',mm]), clear ({'beinge’,mm,'a’]); 
clear(['fifthe',mm,'a']);clear (['girle',mm,'a'});clear ([{'zape’.mm,'a’]) 
clear ({'mane',mm,'a']);clear (['mene',mm,'a'));clear (['sate',mm,'a']) 
clear ({'seeinge',mm,'a’]);clear (['sete',mm,'a']);clear (['site',mm,'a’']) 
clear ({'threee',mm,'a'});clear ({'treee',.mm,'a']);clear (['‘worde',mm,'a’']) 
clear ([{'worlde',mm,'a']); 
end; clear m mm E fs P N clne 





% Save each words AR spectra in a matrix AA 

save (['AA',g]}) 

Sywnnnnnnnnnnnnnnnnn anne nn nn nana nn nn neem nn nnnnennnnnnennnennnnnnnnennnnnnanaennnenenenemenennnn nn nan % 
% Called function 
ynnnn-nennnn--nnnnnnnnnnnennnn nnn en nnn nn en nennnn ne nene nena nneennnnnnnnee een nnnenennnnnanenennnannes % 


% arcf.m (function), Calculates the AR spectra from the digitized recordings 
% AR model using the autocorrelation method and ar_corr.m from the Naval 
% Postgraduate School SPC toolbox [11] 

% Inputs 

% data is the digitized recordings of each word separately 

% P is the order of the AR model desired 

% N is the length of the frequency response desired 


function [MHz,xax,bo,a,data] = arcorfmp(data); 
P = 24; N= 512; 


% Normalize the data 

datamm = data - mean(data); 

load B100, 

fdatal = filter(B 100, 1,datamm),; 

load Ipf; 

fdata = filter(Bb, Aa,fdata 1); 

fdatamm = fdata - mean(fdata); 

fdatammn = fdatamm ./(sqrt(fdatamm'*fdatamm)); 
data = fdatammn; 


% Calculate the AR model coefficients and gain 
{a,bo,s,R] = ar_corr(data,P); 


% Calculate the frequency response of the AR coefficients with gain bo 
Hz = freqz(bo,a,N); 

% Calculate the power of the frequency response in dB 

MHz = 20*log10(abs(Hz)); 
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APPENDIX B MATLAB™ IMPLEMENTATION OF THE ITAKURA DISTANCE 


% Itk.m, Itakura distance 
% This program calculates the Itakura distance for a matrix of input AR spectra TG with 
Yrespect to the reference model reff 


% Inputs 

% AA {is the matrix of AR spectra from the reference group 

% TG is the natrix of AR spectra from the test group and the reference model 
% N is the length of the AR spectra considered 

% n&S is the number of speakers in the test group 


% Calculate the reference model 
reff = mean(AA’); 


% Calculate the frequency response from the AR spectra for the reference model 
Sr=1 /(10 “(reff / 10)); 
Sr = Sr(:); 


% Calculate the Itakura distance 

%% Test group and reference model 

for u = 1:2*nS+1, 

S =1 / (10 .“TGC,u) ./10)); 

dSdSr(u) = log(sum(Sr ./ S)) + log(sum(S ./ Sr)) + log(1/N/N); 


end 
%% Reference group 
for ue = 1:32 


Srr = 1 ./(10 (AAC, ue) ./10)); 
dSrdSr(ue) = log(sum(Sr ./ Srr)) + log(sum(Srr ./ Sr)) + log(1/N/N); 
end 


% Check, measure the reference model against itself dchk should equal zero 


Schk =(1 ./(10 .{(reff ./10)))'; 
dchk = log(sum(Sr ./ Schk)) + log(sum(Schk ./ Sr)) + log(1/N/N); 
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APPENDIX C MATLAB™ CROSS-CORRELATION COEFFICIENTS 
% Code implementation obtained from [9] 


% ccdist.m last modified 3/10/94 MPF 

% computes variuos distances between the AR spectra 

% rr is the normalized cross correlation (no DC component present) 

% ur is the normalized cross correlation (includes potential DC effects) 


% Inputs 

% AA is the matrix of AR spectra for the reference group 
% TG is the matrix of AR spectra for the test group 

% nis the number of speakers in the test group 


ref0=AA', 
x=TG', 


% Compute reference model 
ref=mean(ref0); 
(n,b]=size(x); 


“% Compute the modified reference model for the modified cross-correlation coefficient 
refn=ref-mean(ref); 


% Sum over each col. to get norm. ref. 
refn=refn+eps* ones(size(refn)), 


“% Compute the cross-correlation coefficients 

for i=1:n 
xeps(i,:)=x(i,:)+eps*ones(size(x(i,:))); 
ur(i)=ref*x(i,:)'/(sqrt(ref*ref *xeps(i,:)*xeps(i,:)')); 
y(i,:)=x(i,:)-mean(x(i,:)); 
yeps(i,:)=y(i,:)+eps*ones(size(y(i,:))); 
rr(i)=refn* y(i,:)'/(sqrt(refn*refn'* yeps(i,:)*yeps(i,:)')); 
x(i,:)}=xeps(i,:)/sum(xeps(i,:)); 

end 


63 


APPENDIX D MATLAB™ IMPLEMENTATION OF THE LOG SPECTRAL 
% CB.m, Log Spectral Distance 


% Inputs 

% AA is the matrix of AR spectra for the reference group 
% TG is the matrix of AR spectra for the test group 

% nS is the number of speakers in the test group 


% Compute the reference model 
Ref = mean(AA’); 


% Compute the log spectral distance 
%% Test group 
for n= 1:(2*nS) + 1 

diff(:,n) = TG(:,n) - Ref, 

d(n) = sum(abs(diff(:.n))); 


end 
%% Reference group 
for m = 1:32 


diffe(:,m) = AA(:,m) - Ref, 
de(m) = sum(abs(diffe(:,m))); 
end 





APPENDIX E MATLAB™ IMPLEMENTATION OF THE BOUNDS MEASURE 


% Bounds.m, Bounds measure 


% INPUTS 

% N = length of the AR spectra sequences considered 

% AA is a matrix of AR spectra from the reference group 
% TG is a matrix of AR spectra from the test group 

% Tg is the number of speakers in the test group 

Tg = nS, 


% Calculate the Bounds 
forn=1:N 
Ib(n,:) = min(AA(n,:)); 
ub(n,:) = max(AA(n,:)); 
end 


% Calculate the Percentage of AR spectra that is outside of the bounds 
for ep = 1:2*Tg+1 
Ep = TG(:,ep), 
ebu(:,ep) = ub - Ep; 
ebl(:,ep) = Ep - Ib; 
ce(ep,:) = sizeffind(ebu(:,ep)<0 | ebl(:,ep)<0)); 
end 
cce = ce(:,1); 


for ne = 1:Tg 
chke = ce(ne); 
if chke ~= 0 
faile(ne) = ne; 
end,end 


espk = ce(:,1); 
for fne = 1:length(espk); 

pespk(fne) = 100*(ce(fne)/(2*N)): 
end 


% Calculate the speakers that are outside the bounds by more than five percent 
Tsg = zeros(size(pespk)); 

Eng_test_bnd = find(pespk>S), 

Tsg(Eng_test_bnd) = ones(size(Eng_test_bnd)); 
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% Calculate the Speakers that are outside the bounds for more than five percent for both 
%times a word has been recorded. 
forn=1:Tg 
if Tsg(n) = 1 
if Tsg(n) = Tsg(n+15) ; 
TSG(n) = n; 
else 
TSG(n) = 0; 
end 
else 
TSG(n) = 0; 
end 
end 


clear n 


% Fail_Bnd is the results of a failed bounds measure 
for n= 1:Tg 

Fail_ Bnd = find(TSG(1:Tg)>0), 
end 


APPENDIX F MATLAB™ IMPLEMENTATION OF THE RESULTS 
% Results.m, Results calculation 


% Inputs 

% N is the length of the AR spectra considered 

% AA is the matrix of the AR spectra fcr all native English speakers 

% BB is the matrix of the AR spectra for all non-native English speakers 
% TG is the matrix of the AR spectra for the test group 

% ff is the frequency upper limit considered for the AR spectra 

% gisa text string which represents a word on the word list considered 
% nS is the number of speakers tested 

N = 300; 

ff = 2400, 

n§ = 21, 

xax = (0:ff/(N-1):ff); 


% Cut the AR spectra from 512 to N length 
CC = AA(I:N,:); clear AA 
BB = BB(1:N,:); 


% Set up test and reference groups 
%% First sixteen reference group 
AA=[CC(:, 1:16), CC(:,32:47)]; 


% Compute the reference model 

REF = mean(AA’), 

%% Test group associated to reference group selected and the reference model 
TG = [CC(:,17:31), BB(:, 1:6), CC(:,48:62), BB(:,7:12), REF’); 


% Run performanc measures 


Itk 

CB 

Bnd 

ccdist 
urte = ur; 
Itte = rr; 
TG=AA:; 

ccdist 
ure = ur, 
tre = tr; 
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% Check Code 

%% _ Itakura reference should equal zero 
IdSdSr = fliplr(dSdSr), 

Itakura_ref = IdSdSr(1); 


%% Cross-correlation coefficients references should equal one 
CCrrf = flipir(rrte); 

CCurf = flipir(urte), 

CrosRef = [CCrrf(1) CCurf(1)); 


%% City block metric reference should equal 0 
DifRef = flipir(d); 
Diff_reff = DifRef{1). 


% Set-up check 
CHECK = round({Itakura_ref Diff_reff CrosRef(1) CrosRef(2)]); 
CHECKcheck = [0 0 1 1]; 


if CHECK == CHECKcheck 
{g,' CHECKS GOOD’ 
else 
(g,' CHECKS BAD’ 
end 


% Display Bound Results 
Fail_Bnd 


% Display Itkura distance results 
Imax = max(dSrdSr); 
Fail_Ik = find(dSdSr>Imax) 


% Display cross-correlation coefficients results 
CrCor_ur = find(urte<min(abs(ure))) 
CrCor_rr = find(rrte<min(abs(rre))) 


% Display city block metric results 
CB = find(d>max(de)) 
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